Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-463: Add more types - time, nano timestamps, UUID to Variant spec #464

Merged
merged 5 commits into from
Dec 10, 2024

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Oct 29, 2024

Rationale for this change

Iceberg tables support time, nano timestamps, UUID types and currently Variant spec doesn't include those. Propose to add those missing types.

What changes are included in this PR?

The spec change.

Fix #463

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 29, 2024

cc @gene-db, @rdblue.

@@ -391,6 +391,10 @@ The Decimal type contains a scale, but no precision. The implied precision of a
| Float | float | `14` | FLOAT | IEEE little-endian |
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| TimeNTZ | time without time zone | `21` | TIME(false, MICROS) | 8-byte little-endian |
| Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that iceberg has added timestamp_ns in the V3 spec. From the Parquet side, we'd better be more generic. Should we consider parameterized timestamp type like what Trino does for Variant type: https://trino.io/docs/current/language/types.html#timestamp-p?

Copy link
Contributor

@Fokko Fokko Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree here. This is also in line with the LogicalTypeAnnotation in parquet-java.

Suggested change
| Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian |
| Timestamp | timestamp | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |

Since this hasn't been published yet, I would also propose:

  • Group these with the other timestamp/time types.
  • Remove the NTZ, as it is similar to the timeunit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I will try to make the name change to align with Parquet annotation The only thing is that we need to have separate type ID for ltz/ntz and micro/nano seconds. so it would be cleaner to have them into separate entries rather than grouping them with parameter.

@aihuaxu aihuaxu requested review from Fokko and wgtmac November 1, 2024 16:41
@@ -386,11 +386,15 @@ The Decimal type contains a scale, but no precision. The implied precision of a
| Exact Numeric | decimal8 | `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Exact Numeric | decimal16 | `10` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
| Date | date | `11` | DATE | 4 byte little-endian |
| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume some of these changes were weren't' intended, these lines shouldn't be changed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH nvm, I see what's going on here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe I don't, "timestamp" isn't a physical type in parquet correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no Date or timestamp for physical type. I'm wondering if this column should mean type category. @gene-db ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just eliminate the first column (and annotate logical type with additional information like exact numeric).

| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian |
| Timestamp | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are trying to correct this it seems like timetamp without time zone should be the logical type and the physical type is int64?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this duplicates the the question @RussellSpitzer asked abve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is answered in the paragraph below. Although "behaves" the same is ambiguous. I'm not sure if the intent here Is if parquet has flexibility to change the physical type. We should make sure we clarify this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just saw the explanation below for logical type.
I think it means: for a string, the engines can choose to encode differently but when it's read, they should be the same.

I updated the logical types for newly added types. Since engines may choose different precisions to encode, they have the same logical type from the paragraph below.

| Timestamp            | timestamp with time zone    | `12`    | TIMESTAMP(isAdjustedToUTC=true, MICROS)     | 8-byte little-endian                                                                        |
| Timestamp            | timestamp with time zone   | `22`    | TIMESTAMP(isAdjustedToUTC=true, NANOS)       | 8-byte little-endian                                                                        |

| Float | float | `14` | FLOAT | IEEE little-endian |
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logical type should indicate precision here (and aboe for the other timestamps)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually reading the paragraph below I guess we don't need precision but do need nts

| Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
| Timestamp | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian |
| UUID | uuid | `24` | UUID | 16 bytes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably note there the encoding order (even though it is duplicative)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 16-byte big-endian.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once we iron out wording on timestamp we can merge.

@aihuaxu aihuaxu requested a review from emkornfield November 26, 2024 05:10
@aihuaxu aihuaxu force-pushed the aixu-add-more-variant-types branch from 45420e7 to de25a28 Compare November 26, 2024 05:14
@emkornfield
Copy link
Contributor

This LGTM, @RussellSpitzer any more comments.

Also, CC @gene-db @rdblue in case there are any concerns.

@emkornfield
Copy link
Contributor

Will merge end of week if there aren't more comments.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @emkornfield.

VariantEncoding.md Outdated Show resolved Hide resolved
VariantEncoding.md Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor

@aihuaxu sorry two small suggestions to avoid overloading "Logical Type" which is already a separate concept in Parquet.

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Dec 9, 2024

@aihuaxu sorry two small suggestions to avoid overloading "Logical Type" which is already a separate concept in Parquet.

Yeah. Totally agree that we should avoid using logical type. I was thinking of using "type category" or "type group" before to avoid overload. But "Type Equivalence Class" also works for me.

aihuaxu and others added 2 commits December 8, 2024 16:20
Co-authored-by: emkornfield <[email protected]>
Co-authored-by: emkornfield <[email protected]>
@emkornfield
Copy link
Contributor

Going to merge. Thanks @aihuaxu

@emkornfield emkornfield merged commit a3dda6a into apache:master Dec 10, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more types - time, nano timestamps, UUID to Variant spec
5 participants